Skip to content

๐Ÿ›๏ธ M00: Core Prompting & Context Engineering โ€‹

This foundational module covers the physical, economic, and computational constraints of prompting LLMs in production systems. You will learn to optimize context windows, configure model parameters, and prevent runtime failures.


๐Ÿ›๏ธ 1. Architectural Deep Dive: Attention & KV Caching โ€‹

When designing production agent loops, prompts are not simple strings. They represent input matrices processed by transformer attention mechanisms.

KV Cache & Context Attention โ€‹

During inference, the model stores key-value (KV) states of previous tokens in memory (KV Cache) to avoid re-evaluating context at every token generation step.

  • VRAM Overhead: The size of the KV Cache scales linearly with context length and number of concurrent requests. Large contexts pressure GPU VRAM, increasing latency and TTFT (Time To First Token).
  • Attention Drift: As context length grows, the model's self-attention score spreads thin, causing it to ignore instructions or constraints placed in the middle of the prompt (the "Lost in the Middle" phenomenon).

Prompt Caching Internals โ€‹

To mitigate KV Cache overhead, providers like Google Gemini and Anthropic Claude cache identical leading context headers at the API level.

  • Cache Hits: Caching is triggered for prefixes exceeding a minimum size (e.g. 1,024 tokens for Claude, 32,768 tokens for Gemini).
  • Byte-for-Byte Match: The cached prefix (system prompts, database schemas, code libraries) must be exactly identical. A single character change (including whitespace) invalidates the cache.
  • Token Economics: Cached input tokens are charged at a significantly reduced rate (up to 90% cheaper than standard input tokens).

๐Ÿ“Š 2. Tradeoff Matrix: Context Optimization Methods โ€‹

MethodLatency (TTFT)VRAM FootprintToken CostOutput ConsistencyPrimary Production Bottleneck
Zero-Shot PromptingUltra-Low (< 200ms)NegligibleVery LowBrittle / LowHallucinations on structured formats
Few-Shot XML PromptingModerate (~500ms)LowLowVery HighToken inflation from repetitive examples
Context CachingLow (after 1st run)High (GCP managed)Ultra-Low (90% off)HighCache eviction cycles on long idle states
Context PruningModerateLowLowHighInformation loss from aggressive compression

๐Ÿ› ๏ธ 3. Step-by-Step Mechanics: Structured Prompts & Tuning โ€‹

To write deterministic code interfaces, we use a structured format combining XML tags and Low-Temperature Parameter Tuning.

1. XML Structured Markup โ€‹

XML tags act as clear boundaries, preventing the LLM from confusing system instructions with user-submitted data payloads:

xml
<system_instructions>
You are an expert PostgreSQL database administrator.
Your goal is to output SQL statements based on user schemas.
</system_instructions>

<constraints>
- Output raw SQL only.
- Do NOT include explanation blocks.
</constraints>

<schema_context>
CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT);
</schema_context>

<user_query>
Add an email column to the users table.
</user_query>

2. Parameter Tuning Configurations โ€‹

Set these parameters in your API config payloads:

  • temperature = 0.0: Forces the model to select the token with the absolute highest probability. This is mandatory for coding and JSON serialization tasks to prevent syntax formatting failures.
  • max_output_tokens: Must be set with a safety buffer (e.g. 2048 or 4096). If set too low, outputs truncate mid-sentence, causing JSON parsing crashes.
  • top_p & top_k: Set to default or 1.0 if temperature = 0.0. If tuning a reasoning agent, keep top_p at 0.95 to allow minor variation while pruning low-probability nonsense tokens.

๐Ÿ›ก๏ธ 4. Failure Mode Analysis: Mitigating Prompt Failures โ€‹

Failure ModeLog Signature / ErrorRoot CauseCode Mitigation
JSON Parse Crashjson.decoder.JSONDecodeErrorOutput truncated due to low max_output_tokens.Increase max_output_tokens or implement Pydantic validation retries.
Attention LossAgent ignores negative constraints.Context window overflow or instruction placed in the middle.Wrap target contents in XML tags; place core constraints at the end of the prompt.
Cache InvalidationIncreased token billing on consecutive runs.Prompt prefix changed (dynamic timestamps, variables, or whitespace).Place all dynamic inputs (user query, runtime variables) at the absolute bottom of the payload.
Rate LimitingResourceExhausted (429)Exceeded provider TPM/RPM constraints.Implement exponential backoff retry loops using tenacity.

๐Ÿงช 5. Runtime Verification: What to Observe โ€‹

To verify your prompt design and cache behavior:

Test 1: Cold vs. Hot Run Verification โ€‹

  1. Launch the Gemini CLI or execute a Python script loading a large (35k token) codebase context.
  2. Observe TTFT:
    • First Run (Cold): Latency will spike (~5-8s) as the API gateway compiles and caches the KV states.
    • Second Run (Hot): Latency should drop to <1.5s, indicating a successful cache hit.
  3. Audit the API request log. Confirm that the billing output logs list the cached input token counts matching your codebase context size.

Test 2: XML Parameter Tuning โ€‹

  1. Run a script that requests structured JSON using a temperature of 0.8.
  2. Perform 50 consecutive runs. Count the number of runs where the output fails to parse (e.g., trailing commas, missing brackets).
  3. Change temperature to 0.0 and repeat. Confirm that formatting errors drop to 0%.